The term xG in football stands for ‘expected goals’. It is a statistical measurement of the quality of goalscoring chances and the likelihood of them being scored.

The goal of this task is to use shots and frames data to identify what could be the best predictors for goals, so that we have a new metric (xG) created, that calculates the probability of a shot being scored.

Data Loading:

First, the 2 dataframes used for this task are loaded, shots_df (containing shot data), and frames_df (containing freeze frame data that represents players other than the shooter with their locations and positions at the time of each shot).

shots.df <- read.csv("./shots_df.csv")
frames.df <- read.csv("./shots_freeze_frames_df.csv")

Data Exploration:

Data is explored and some summary statistics applied.

summary(shots.df)
##       id                period       timestamp             minute     
##  Length:2816        Min.   :1.000   Length:2816        Min.   : 0.00  
##  Class :character   1st Qu.:1.000   Class :character   1st Qu.:26.00  
##  Mode  :character   Median :2.000   Mode  :character   Median :49.00  
##                     Mean   :1.548                      Mean   :48.51  
##                     3rd Qu.:2.000                      3rd Qu.:72.00  
##                     Max.   :2.000                      Max.   :96.00  
##                                                                       
##      second        possession       duration        location        
##  Min.   : 0.00   Min.   :  2.0   Min.   :0.0018   Length:2816       
##  1st Qu.:14.00   1st Qu.: 59.0   1st Qu.:0.4652   Class :character  
##  Median :29.00   Median :112.0   Median :0.9420   Mode  :character  
##  Mean   :29.31   Mean   :109.8   Mean   :0.9960                     
##  3rd Qu.:44.00   3rd Qu.:160.0   3rd Qu.:1.4098                     
##  Max.   :59.00   Max.   :241.0   Max.   :5.3096                     
##                                                                     
##  under_pressure    type.id    type.name         possession_team.id
##  Mode:logical   Min.   :16   Length:2816        Min.   :746.0     
##  TRUE:427       1st Qu.:16   Class :character   1st Qu.:966.0     
##  NA's:2389      Median :16   Mode  :character   Median :969.0     
##                 Mean   :16                      Mean   :938.6     
##                 3rd Qu.:16                      3rd Qu.:971.0     
##                 Max.   :16                      Max.   :974.0     
##                                                                   
##  possession_team.name play_pattern.id play_pattern.name     team.id     
##  Length:2816          Min.   :1.000   Length:2816        Min.   :746.0  
##  Class :character     1st Qu.:1.000   Class :character   1st Qu.:966.0  
##  Mode  :character     Median :2.000   Mode  :character   Median :969.0  
##                       Mean   :2.826                      Mean   :938.7  
##                       3rd Qu.:4.000                      3rd Qu.:971.0  
##                       Max.   :9.000                      Max.   :974.0  
##                                                                         
##   team.name           player.id     player.name         position.id  
##  Length:2816        Min.   : 4633   Length:2816        Min.   : 2.0  
##  Class :character   1st Qu.:10188   Class :character   1st Qu.:12.0  
##  Mode  :character   Median :15579   Mode  :character   Median :17.0  
##                     Mean   :13483                      Mean   :15.9  
##                     3rd Qu.:16379                      3rd Qu.:22.0  
##                     Max.   :24747                      Max.   :25.0  
##                                                                      
##  position.name      shot.statsbomb_xg  shot.end_location  shot.key_pass_id  
##  Length:2816        Min.   :0.005823   Length:2816        Length:2816       
##  Class :character   1st Qu.:0.022912   Class :character   Class :character  
##  Mode  :character   Median :0.048040   Mode  :character   Mode  :character  
##                     Mean   :0.102732                                        
##                     3rd Qu.:0.115396                                        
##                     Max.   :0.887769                                        
##                                                                             
##  shot.one_on_one shot.aerial_won shot.technique.id shot.technique.name
##  Mode:logical    Mode:logical    Min.   :89.00     Length:2816        
##  TRUE:151        TRUE:186        1st Qu.:93.00     Class :character   
##  NA's:2665       NA's:2630       Median :93.00     Mode  :character   
##                                  Mean   :92.96                        
##                                  3rd Qu.:93.00                        
##                                  Max.   :95.00                        
##                                                                       
##  shot.outcome.id  shot.outcome.name   shot.type.id   shot.type.name    
##  Min.   : 96.00   Length:2816        Min.   :62.00   Length:2816       
##  1st Qu.: 97.00   Class :character   1st Qu.:87.00   Class :character  
##  Median : 98.00   Mode  :character   Median :87.00   Mode  :character  
##  Mean   : 98.18                      Mean   :86.16                     
##  3rd Qu.:100.00                      3rd Qu.:87.00                     
##  Max.   :116.00                      Max.   :88.00                     
##                                                                        
##  shot.body_part.id shot.body_part.name    match_id     competition_id
##  Min.   :37.00     Length:2816         Min.   :19714   Min.   :37    
##  1st Qu.:38.00     Class :character    1st Qu.:19739   1st Qu.:37    
##  Median :40.00     Mode  :character    Median :19765   Median :37    
##  Mean   :39.03                         Mean   :19766   Mean   :37    
##  3rd Qu.:40.00                         3rd Qu.:19793   3rd Qu.:37    
##  Max.   :70.00                         Max.   :19822   Max.   :37    
##                                                                      
##    season_id shot.open_goal shot.first_time shot.redirect  shot.deflected
##  Min.   :4   Mode:logical   Mode:logical    Mode:logical   Mode:logical  
##  1st Qu.:4   TRUE:29        TRUE:446        TRUE:11        TRUE:25       
##  Median :4   NA's:2787      NA's:2370       NA's:2805      NA's:2791     
##  Mean   :4                                                               
##  3rd Qu.:4                                                               
##  Max.   :4                                                               
##                                                                          
##  shot.saved_to_post   location.x      location.y    shot.end_location.x
##  Mode:logical       Min.   : 58.0   Min.   : 5.00   Min.   : 84        
##  TRUE:4             1st Qu.: 97.0   1st Qu.:34.00   1st Qu.:115        
##  NA's:2812          Median :105.7   Median :41.00   Median :119        
##                     Mean   :103.7   Mean   :40.62   Mean   :116        
##                     3rd Qu.:111.0   3rd Qu.:47.83   3rd Qu.:120        
##                     Max.   :120.0   Max.   :78.60   Max.   :120        
##                                                                        
##  shot.end_location.y shot.end_location.z
##  Min.   : 2.00       Min.   :0.000      
##  1st Qu.:36.60       1st Qu.:0.600      
##  Median :40.10       Median :1.300      
##  Mean   :40.29       Mean   :1.726      
##  3rd Qu.:43.80       3rd Qu.:2.400      
##  Max.   :80.00       Max.   :7.600      
##                      NA's   :853
summary(frames.df)
##       id              location.x      location.y     teammate      
##  Length:34476       Min.   :  2.0   Min.   : 0.00   Mode :logical  
##  Class :character   1st Qu.:100.0   1st Qu.:34.00   FALSE:22808    
##  Mode  :character   Median :106.0   Median :40.00   TRUE :11668    
##                     Mean   :104.9   Mean   :40.46                  
##                     3rd Qu.:113.0   3rd Qu.:47.00                  
##                     Max.   :120.0   Max.   :80.00                  
##    player.id     player.name         position.id    position.name     
##  Min.   : 4633   Length:34476       Min.   : 1.00   Length:34476      
##  1st Qu.:15554   Class :character   1st Qu.: 3.00   Class :character  
##  Median :15709   Mode  :character   Median :10.00   Mode  :character  
##  Mean   :14919                      Mean   :10.48                     
##  3rd Qu.:17275                      3rd Qu.:16.00                     
##  Max.   :24931                      Max.   :25.00

Data Cleaning:

NAs removed, categorical variables adjusted, Target variable added (is_goal): created from outcome.name variable.

shots.df[is.na(shots.df)] <- FALSE

shots.df$is_goal <- 0
shots.df$is_goal[shots.df$shot.outcome.name == "Goal"] <- 1

Feature Engineering:

The 2 dataframes are joined, and the needed features for the analysis are created.

Features to be used:
  1. Shot distance: Distance between the shot location and the central point of the goal line (120,40).
  2. Shot angle: calculated from shot location, dimensions of the field and length of the goal (7.32 meters)
  3. opponents in space: how many opponents in between the player and the goal
  4. defenders in space: how many of them are defenders
  5. goalkeeper position
cleaned.data <- shots.df %>%
  rename(x= location.x, y= location.y)%>%
  mutate(shot.distance = sqrt((120 - x)^2 + (40 - y)^2),
         shot.angle = atan(7.32*(120 - x)/((120 - x)^2+(40 - y)^2-(7.32/2)^2))* 180/pi) %>%
  modify_if(is.character, as.factor) 

# Join the 2 DFs at shot id, and extract features about opponents positions:
## (number of opponents in between the player and the goal, number of defenders, position of goalkeeper).
joined.df <- merge(cleaned.data, frames.df, by = "id") %>%
  rename(other.x= location.x, other.y= location.y, other.position = position.name.y)%>%
  mutate(other.distance = sqrt((120 - other.x)^2 + (40 - other.y)^2),
         other.in.space = ifelse(other.distance < shot.distance, TRUE, FALSE),
         opp.in.space = ifelse(other.in.space & !teammate, TRUE, FALSE),
         defenders.in.space = ifelse(opp.in.space & grepl("Defensive", other.position), TRUE, FALSE))

eng.df <- joined.df %>% 
  group_by(id,player.name.x,team.name,
           minute,second,possession,duration,under_pressure,
           play_pattern.name, shot.body_part.name,
           shot.technique.name, shot.type.name, match_id,
           shot.open_goal, shot.first_time, shot.redirect, shot.deflected, shot.saved_to_post, 
           shot.distance, shot.angle,shot.statsbomb_xg, is_goal) %>%
  summarize(sum.opp.in.space = sum(opp.in.space),
            sum.defenders.in.space = sum(defenders.in.space),
            goal_keeper_distance = other.distance[other.position == "Goalkeeper"])

Data Exploration Again:

After feature engineering and joining the dataframes, data is explored again to discover relations and correlations between the variables are explored.

correlation matrix

correlation matrix

Data Modeling:

Next, a model is created to make predictions of expected goals. First, the data is split into train and test. Then, logistic regression is applied to the data to generate probabilities of shots being scored. The features used to build this model are: shot.distance, shot.angle, sum.opp.in.space, sum.defenders.in.space, and goal_keeper_distance.

# split the data into train & test
data <- eng.df
set.seed(101) 
# Selecting 80% as train data  
sample <- sample.int(n = nrow(data), size = floor(.8*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

# Build the model: logistic regression
mod <- glm(is_goal ~ shot.distance  + shot.angle + sum.opp.in.space + sum.defenders.in.space + goal_keeper_distance, data=train, family=binomial)
summary(mod)
## 
## Call:
## glm(formula = is_goal ~ shot.distance + shot.angle + sum.opp.in.space + 
##     sum.defenders.in.space + goal_keeper_distance, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4192  -0.5105  -0.3226  -0.1902   3.8031  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.271496   0.239924  -1.132    0.258    
## shot.distance          -0.062306   0.012379  -5.033 4.82e-07 ***
## shot.angle             -0.001318   0.003136  -0.420    0.674    
## sum.opp.in.space       -0.252589   0.047312  -5.339 9.35e-08 ***
## sum.defenders.in.space -0.123378   0.192083  -0.642    0.521    
## goal_keeper_distance    0.095040   0.023477   4.048 5.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1490.9  on 2218  degrees of freedom
## Residual deviance: 1276.1  on 2213  degrees of freedom
## AIC: 1288.1
## 
## Number of Fisher Scoring iterations: 6
# getting the intercepts of the model
int <- coef(mod)
int_coef <- int[1]
dist_coef <- int[2]
ang_coef <- int[3]
opp_coef <- int[4]
def_coef <- int[5]
gk_coef <- int[6]

# giving the xG value to the shots
for (i in seq(1,nrow(data))){
  sum = int_coef + ang_coef*data[i,"shot.angle"] + (dist_coef*data[i,"shot.distance"])+ 
    (opp_coef*data[i,"sum.opp.in.space"])+ (def_coef*data[i,"sum.defenders.in.space"])+ 
    (gk_coef*data[i,"goal_keeper_distance"])
  data[i,"xG"] = exp(sum)/(1+exp(sum))
}
data
## # A tibble: 2,774 x 26
## # Groups:   id, player.name.x, team.name, minute, second, possession, duration,
## #   under_pressure, play_pattern.name, shot.body_part.name,
## #   shot.technique.name, shot.type.name, match_id, shot.open_goal,
## #   shot.first_time, shot.redirect, shot.deflected, shot.saved_to_post,
## #   shot.distance, shot.angle, shot.statsbomb_xg, is_goal [2,769]
##    id             player.name.x  team.name     minute second possession duration
##    <fct>          <fct>          <fct>          <int>  <int>      <int>    <dbl>
##  1 0024316a-8bbe~ Vivianne Mied~ Arsenal WFC       93     18        162    0.354
##  2 002e8652-613e~ Hannah Cain    Everton LFC       88     50        217    2.29 
##  3 003b0566-1e1d~ Nikita Parris  Manchester C~      3     23         10    0.351
##  4 0047b26b-5f8d~ Melissa Lawley Manchester C~     35     28         82    0.678
##  5 005aa8fd-7fc8~ Kayleigh Green Brighton & H~      2     18          9    1.85 
##  6 007260ed-38c4~ Christie Murr~ Liverpool WFC     14      4         35    1.20 
##  7 00772dae-2c12~ Abigail Harri~ Bristol City~     49     50         92    0.747
##  8 00781c4f-9579~ Angharad James Everton LFC       56     17        142    1.85 
##  9 00849624-27dd~ Nikita Parris  Manchester C~     14     38         28    1.38 
## 10 0084c484-a5ed~ Brooke Hendrix West Ham Uni~     56     43        120    0.874
## # ... with 2,764 more rows, and 19 more variables: under_pressure <lgl>,
## #   play_pattern.name <fct>, shot.body_part.name <fct>,
## #   shot.technique.name <fct>, shot.type.name <fct>, match_id <int>,
## #   shot.open_goal <lgl>, shot.first_time <lgl>, shot.redirect <lgl>,
## #   shot.deflected <lgl>, shot.saved_to_post <lgl>, shot.distance <dbl>,
## #   shot.angle <dbl>, shot.statsbomb_xg <dbl>, is_goal <dbl>,
## #   sum.opp.in.space <int>, sum.defenders.in.space <int>,
## #   goal_keeper_distance <dbl>, xG <dbl>

Plotting and Visualization:

Next, some useful visualizations are created to plot the new metric (xG) against different variables and compare it to actual goals.

1. xG versus shot distance
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

2. xG vs statsBomb_xG

3. Shots distribution across xG:

A plot visualizing shots spread versus xG to see where the majority of shots lie

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
4. Compare actual goals and expected goals across different matches for “Manchester City WFC”

Drawn at threshold = 0.3.

# Compare actual goals and expected goals across different matches for "Manchester City WFC"
# at threshold = 0.3
actual.to.xG <- data %>% filter(team.name == "Manchester City WFC") %>%
  group_by(match_id)%>%
  mutate(xG.as.goals = ifelse(xG >= 0.3, 1, 0))%>%
  summarise(actual.goals = sum(is_goal), expected.goals = sum(xG.as.goals)) %>%
  melt(id = c("match_id"))%>%
  ggplot(aes(match_id, value, color = variable)) + 
  geom_smooth()+
  labs(title= "Actual Goals vs. Expected Goals for Manchester City WFC",x ="Match ID", y = "Goals Count")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'